/* ************************************************************************
 * Copyright 2013 Advanced Micro Devices, Inc.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 *
 * http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 * ************************************************************************/

/*! @file clFFT.mainpage.h

This file contains all documentation, no code, in the form of comment text.  It's purpose is to provide
chapter 1 of the documentation we produce with doxygen.  This included the title page, installation instructions
and prose on the nature of FFT's and their use in our library.

@mainpage OpenCL Fast Fourier Transforms (FFT's)

The clFFT library is an OpenCL library implementation of discrete Fast Fourier Transforms. It:
@li Provides a fast and accurate platform for calculating discrete FFTs.
@li Works on CPU or GPU backends.
@li Supports in-place or out-of-place transforms.
@li Supports 1D, 2D, and 3D transforms with a batch size that can be greater than 1.
@li Supports planar (real and complex components in separate arrays) and interleaved (real and complex
components as a pair contiguous in memory) formats.
@li Supports dimension lengths that can be any mix of powers of 2, 3, and 5.
@li Supports single and double precision floating point formats.

@section InstallFFT Installation of clFFT library

@subsection DownBinaries Downloadable Binaries
AMD provides clFFT library pre-compiled packages for recent versions of Microsoft Windows operating systems
and several flavors of Linux.

The downloadable binary packages are freely available from AMD at
http://developer.amd.com/tools-and-sdks/heterogeneous-computing/amd-accelerated-parallel-processing-math-libraries/

Once the appropriate package for the respective OS has finished downloading,
uncompress the package using the native tools available on the platform in a
directory of the user's choice. Everything needed to build a program using
clFFT is included in the directory tree, including documentation, header files,
binary library components, and sample programs for programming illustration.

@subsubsection CMakeDependancy CMake
After the clFFT package is uncompressed on the user's hard drive, a samples directory exists with source code,
but no Visual Studio project files, Unix makefiles, or other native build system exist. Instead, it contains a
\c CMakeLists.txt file. clFFT uses CMake as its build system, and other build files, such as Visual Studio projects,
nmake makefiles, or Unix makefiles, are generated by the CMake build system, during configuration. CMake is freely
available for download from: http://www.cmake.org/

@note CMake generates the native OS build files, so any changes made to the native build files are overwritten the
next time CMake is run.

CMake is written to pull compiler information from environment variables, and to look in default install
directories for tools. Once installed, a popular interface to control the process of creating native build
files is CMake-gui. When the GUI is launched, two text boxes appear at the top of the dialog: a path to
source and a separate path to generate binaries. For the \c browse source... box, find the path to where you
unzipped clFFT, and select the root \c samples directory that contains the CMakeLists.txt; for clFFT,
this should be \c clFFT/samples.  For \c browse \c build..., select an appropriate directory where the build
environment generates build files; a convenient location is a sibling directory to the source. This makes
it easy to wipe all the binaries and start a fresh build. For instance, for a debug configuration of NMake,
an example directory could be \c clFFT/bin/NMakeDebug. This is where the generated makefile, native build
files, and intermediate object files are built. These generated files are kept separate from the source;
this is referred to as 'out-of-source' builds, and is very similar in concept to what 'autotools' does for Linux.
To build using NMake, simply type NMake in the build directory containing the makefile. To build using
Visual Studio, generate the solution and project files into a directory such as \c clFFT/bin/vs10, find the
generated \c .sln file, and open the solution.

The first time the \c configure button near the bottom of the screen is clicked, it causes CMake to prompt for
what type of native build files to make. Various properties appear in red in the \c properties box. Red indicates
that the value has changed since last time \c configure was clicked. (The first time configure is clicked,
everything is red.) CMake tries to configure itself automatically to the client's system by looking at a systems
environment variables and by searching through default install locations for project dependencies. Take a moment to
verify the settings and paths that are displayed on the configuration screen; if any changes must be made, you can
provide correct paths or adjust settings by typing directly into the CMake configuration screen. Click the
\c configure button a second time to 'bake' those settings and serialize them to disk.

Options relevant to the clFFT project include:

@li \c 'AMDAPPSDKROOT': Location of the Stream SDK installation. This value is already populated if CMake
could determine the location by looking at the environment variables. If not, the user must provide a path to
the root installation of the Stream SDK here.

@li \c 'BOOST_ROOT':  Location of the Boost SDK installation. This value is already populated if CMake could
determine the location by looking at the environment variables or default install locations. If not, the user must
provide a path to the root installation of the Stream SDK here. This dependency is only relevant to the sample
client; the FFT library does not depend on Boost.

@li \c 'CMAKE_BUILD_TYPE':  Defines the build type (default is debug). For Visual Studio projects, this does
not appear (modifiable in IDE); for makefile-based builds, this is set in CMake.

@li \c 'CMAKE_INSTALL_PREFIX':  The path to install all binaries and headers generated from the build. This is
used when the user types \c make \c install or builds the INSTALL project in Visual Studio. All generated binaries and
headers are copied into the path prefixed with \c CMAKE_INSTALL_PREFIX.  The Visual Studio projects are self
explanatory, but a few other projects are autogenerated; these might be unfamiliar.

The Visual Studio projects are self explanatory, but a few other projects are autogenerated; these might be unfamiliar.

@li \c 'ALL_BUILD': A project that is empty of files, but since it depends on all user projects, it provides a
convenient way to rebuild everything.

@li \c 'ZERO_CHECK':  A CMake-specific project that checks to see if the generated solution and project files are in sync
with the \c CMakeLists.txt file. If these files are modified, the solutions and projects are now out-of-sync, and this
project prompts the user to regenerate their environment.

@note If the user chooses to build on Windows with a NMake based build, it is important to launch CMake from within a
Visual Studio Command Prompt (20xx).  This is because CMake must be able to parse environment variables to properly
initialize NMake. This is not necessary if a Visual Studio solution is generated, because solution files contain their
own environmental setup.

@subsubsection BoostDependancy Boost
clFFT includes one sample project that has source dependencies on Boost: the sample client project. Boost is
freely available from:  http://www.boost.org/.

The command-line clFFT sample client links with the \c program_options library, which provides functionality for
parsing command-line parameters and \c .ini files in a cross-platform manner. Once Boost is downloaded and
extracted on the hard drive, the \c program_options library must be compiled. The Boost build system
uses the BJam builder (a project for a CMake-based Boost build is available for separate download). This is
available for download from the Boost website, or the user can build BJam; Boost includes the source to BJam
in its distribution, and the user can execute \c bootstrap.bat (located in the root boost directory) to build it.

After BJam is either built or installed, an example BJam command-line is given below for building a 64-bit
\c program_options binary, for both static and dynamic linking:
@code
bjam --with-program_options address-model=64 link=static,shared stage
@endcode

The last step to make boost readily available and usable by CMake and the native compiler is to add an environment
variable to the system called \c BOOST_ROOT. In Windows, right click on the computer icon and go to
@code
'Properties|Advanced system settings|Advanced|Environment Variables...'
@endcode
Remember to relaunch any new processes that are open, in order to inherit the new environment variable. On Linux,
consider modifying the \c .bash_rc file (or shell equivalent) to export a new environment variable every time you log in.

If you are on a Linux system and have used a package manager to install Boost, you may have to confirm where the Boost
\c include and \c library files have been placed. For example, after installing Boost with the Ubuntu Synaptic Package
Manager, the Boost \c include files are in \c /usr/include/boost, and the library files either \c /usr/lib or \c /usr/lib64.
The \c CMakeLists.txt file in this project defaults the \c BOOST_ROOT value to \c /usr on Linux; so, if the system is set up
similarly, no further action is necessary. If the system is set up differently, you may have to set the \c BOOST_ROOT
environmental variable accordingly.

@note Note that CMake does not recognize version numbers at the end of the library filename; so, if the package
manager only created a \c libboost_module_name.so.x.xx.x file (where x.xx.x is the version of Boost),
the user may need to manually create a soft link called \c libboost_module_name.so to the versioned
\c libboost_module_name.so.x.xx.x. See the clFFT binary artifacts in the install directory for an example.

@section IntroFFT Introduction to clFFT

The FFT is an implementation of the Discrete Fourier Transform (DFT) that makes use of symmetries in the FFT
definition to reduce the mathematical intensity required from O(\f$N^2\f$) to O(\f$ N \log N\f$) when the
sequence length, \c N, is the product of small prime factors.  Currently, there is no standard API for FFT
routines. Hardware vendors usually provide a set of high-performance FFTs optimized for their systems:
no two vendors employ the same interfaces for their FFT routines. clFFT provides a set of FFT routines that
are optimized for AMD graphics processors, and that also functional across CPU and other compute devices.

@subsection SupportRadix Supported Radices
clFFT supports powers of 2, 3 and 5 sizes. This means that the vector lengths that can be
configured through a plan can be any length that is a power of two, three, and five; examples include \f$2^7, 2^1*3^1, 3^2*5^4, 2^2*3^3*5^5\f$,
up to the limit that the device can support.

@subsection SizeLimit Transform Size Limits
Currently, there is an upper bound on the transform size the library supports. This
limit is \f$2^{24}\f$ for single precision and \f$2^{22}\f$ for double precision. This means that the
product of transform lengths must not exceed these values. As an example, a
1D single-precision FFT of size 1024 is valid since 1024 \f$<= 2^{24}\f$. Similarly, a 2D
double-precision FFT of size 1024x1024 is also valid, since 1024*1024 \f$<= 2^{22}\f$.
But, a 2D single-precision FFT of size 4096x8192 is not valid because
4096*8192 > 224.

@subsection EnumDim Dimensionality
clFFT currently supports FFTs of up to three dimensions, given by the enum \c clFFT-Dim. This enum
is a required parameter into \c clfftCreateDefaultPlan() to create an initial plan; there is no default for
this parameter. Depending on the dimensionality that the client requests, clFFT uses the formulations
shown below to compute the DFT.

The definition of a 1D complex DFT used by clFFT is given by:
\f[
{\tilde{x}}_j = {{1}\over{scale}}\sum_{k=0}^{n-1}x_k\exp\left({\pm i}{{2\pi jk}\over{n}}\right)\hbox{ for } j=0,1,\ldots,n-1
\f]
where \f$x_k\f$ are the complex data to be transformed, \f$\tilde{x}_j\f$ are the transformed data, and the sign
of \f$\pm\f$ determines the direction of the transform: \f$-\f$ for forward and \f$+\f$ for backward. Note that
the user must provided the scaling factor.  Typically, the scale is set to 1 for forward transforms, and
\f${{1}\over{N}}\f$ for backwards transforms.

The definition of a complex 2D DFT used by clFFT is given by:
\f[
{\tilde{x}}_{jk} = {{1}\over{scale}}\sum_{q=0}^{m-1}\sum_{r=0}^{n-1}x_{rq}\exp\left({\pm i} {{2\pi jr}\over{n}}\right)\exp\left({\pm i}{{2\pi kq}\over{m}}\right)
\f]
for \f$j=0,1,\ldots,n-1\hbox{ and } k=0,1,\ldots,m-1\f$, where \f$x_{rq}\f$ are the complex data to be transformed,
\f$\tilde{x}_{jk}\f$ are the transformed data, and the sign of \f$\pm\f$ determines the direction of the
transform.  Typically, the scale is set to 1 for forwards transforms and \f${{1}\over{M \cdot N}}\f$ for backwards transforms.

The definition of a complex 3D DFT used by clFFT is given by:
\f[
\tilde{x}_{jkl} = {{1}\over{scale}}\sum_{s=0}^{p-1}\sum_{q=0}^{m-1}\sum_{r=0}^{n-1}
x_{rqs}\exp\left({\pm i} {{2\pi jr}\over{n}}\right)\exp\left({\pm i}{{2\pi kq}\over{m}}\right)\exp\left({\pm i}{{2\pi ls}\over{p}}\right)
\f]
for \f$j=0,1,\ldots,n-1\hbox{ and } k=0,1,\ldots,m-1\hbox{ and } l=0,1,\ldots,p-1\f$, where \f$x_{rqs}\f$ are the complex data
to be transformed, \f$\tilde{x}_{jkl}\f$ are the transformed data, and the sign of \f$\pm\f$ determines the direction of the
transform.  Typically, the scale is set to 1 for forwards transforms and \f${{1}\over{M \cdot N \cdot P}}\f$ for backwards transforms.

@subsection InitLibrary Setup and Teardown of clFFT
clFFT is initialized by a call to \c clfftSetup(), which must be called before any other API exported
from clFFT. This allows the library to create resources used to manage the plans that are created and
destroyed by the user. This API also takes a structure \c clfftInitSetupData that is initialized by the
client to control the behavior of the library. The corresponding \c clfftTeardown() method must be called
by the client when it is done using the library. This instructs clFFT to release all resources, including
any acquired references to any OpenCL objects that may have been allocated or passed to it through the
API.

@subsection ThreadSafety Thread safety
The clFFT API is designed to be thread-safe. It is safe to create plans from multiple threads, and to
destroy those plans in separate threads. Multiple threads can call \c clfftEnqueueTransform() to place work
into a command queue at the same time. clFFT does not provide a single-threaded version of the library.
It is expected that the overhead of the synchronization mechanisms inside of clFFT thread safe is minor.

Currently, multi-device operation must be managed by the user. OpenCL contexts can be created that are
associated with multiple devices, but clFFT only uses a single device from that context to transform
the data. Multi-device operation can be managed by the user by creating multiple contexts, where each
context contains a different device, and the user is responsible for scheduling and partitioning the work
across multiple devices and contexts.

@subsection MajorFormat Row Major formats
clFFT expects all multi-dimensional input passed to it to be in row-major format. This is compatible
with C-based languages. However, clFFT is very flexible in the input and output data organization it
accepts by allowing the user to specify a stride for each dimension. This feature can be used to process
data in column major arrays, and other non-contiguous data formats. See \ref clfftSetPlanInStride and
\ref clfftSetPlanOutStride.

@subsection Object OpenCL object creation
OpenCL objects, such as contexts, \c cl_mem buffers, and command queues, are the responsibility of the
user application to allocate and manage. All of the clFFT interfaces that must interact with OpenCL
objects take those objects as references through the API. Specifically, the plan creation function
@ref clfftCreateDefaultPlan() takes an OpenCL context as a parameter reference, increments the reference
count on that object, and keeps the object alive until the corresponding plan has been destroyed through
a call to @ref clfftDestroyPlan().

@subsection FlushQueue Flushing of command queues
The clFFT API operates asynchronously, and with the exception of thread safety locking with multiple
threads, all APIs return immediately. Specifically, the @ref clfftEnqueueTransform() API does not
explicitly flush the command queues that are passed by reference to it; it pushes the transform work onto the
command queues and returns the modified queues to the client. The client is free to issue its own blocking
logic, using OpenCL synchronization mechanisms, or push further work onto the queue to continue processing.

@section clFFTPlans clFFT Plans

A plan is the collection of (almost) all of the parameters needed to specify an FFT computation.
This includes:
<ul>
<li> What OpenCL context executes the transform?
<li> Is this a 1D, 2D or 3D transform?
<li> What are the lengths or extents of the data in each dimension?
<li> How many datasets are being transformed?
<li> What is the data precision?
<li> Should a scaling factor be applied to the transformed data?
<li> Does the output transformed data replace the original input data in the same buffer (or
buffers), or is the output data written to a different buffer (or buffers).
<li> How is the input data stored in its data buffers?
<li> How is the output data stored in its data buffers?
</ul>

The plan does not include:
<ul>
<li> The OpenCL handles to the input and output data buffers.
<li> The OpenCL handle to a temporary scratch buffer (if needed).
<li> Whether to execute a forward or reverse transform.
</ul>
These are specified when the plan is executed.

@subsection Default Default Plan Values

When a new plan is created by calling @ref clfftCreateDefaultPlan, its parameters are initialized as
follows:

<ul>
<li> Dimensions: as provided by the caller.
<li> Lengths: as provided by the caller.
<li> Batch size: 1.
<li> Precision: \c CLFFT_SINGLE.
<li> Scaling factors:
    <ol>
    <li> For the forward transform, the default is 1.0, or no scale factor is applied.
    <li> For the reverse transform, the default is 1.0 / P, where P is the product of the FFT lengths.
    </ol>
<li> Location: \c CLFFT_INPLACE.
<li> Input layout: \c CLFFT_COMPLEX_INTERLEAVED.
<li> Input strides: the strides of a multidimensional array of the lengths specified, where the data is
compactly stored using the row-major convention.
<li> Output layout: \c CLFFT_COMPLEX_INTERLEAVED.
<li> Output strides: same as input strides.
</ul>

Writing client programs that depend on these initial values is <b> not </b> recommended.

@subsection EnumLayout Supported Memory Layouts
There are two main families of Discrete Fourier Transform (DFT):
<ul>
<li> Routines for the transformation of complex data. clFFT supports two layouts to store complex numbers:
a 'planar' format, where the real and imaginary components are kept in separate arrays:
<ol>
	<li> Buffer1: \c RRRRR
	<li> Buffer2: \c IIIII
</ol>
and an interleaved format, where the real and imaginary components are stored as contiguous pairs:
<ol>
	<li> Buffer1: \c RIRIRIRIRIRI
</ol>
<li> Routines for the transformation of real to complex data and vice versa; clFFT provides enums to define
these formats. For transforms involving real data, there are two possibilities:
<ul>
<li> Real data being subject to forward FFT transform that results in complex
data.
<li> Complex data being subject to backward FFT transform that results in
real data. See the Section "FFTs of Real Data".
</ul>
</ul>

@subsubsection DistanceStridesandPitches Strides and Distances
For one-dimensional data, if clStrides[0] = strideX = 1, successive elements in the first dimension are stored contiguously
in memory. If strideX is an integral value greater than 1, gaps in memory exist between each element of
the vectors.

For multi-dimensional data, if clStrides[1] = strideY = LenX for 2 dimensional data and clStrides[2] = strideZ
= LenX*LenY for 3 dimensional data, no gaps exist in memory between each element, and all vectors are
stored tightly packed in memory. Here, LenX, LenY, and LenZ denote the transform lengths clLengths[0],
clLengths[1], and clLengths[2], respectively, which are used to set up the plan.

By specifying non-default strides, it is possible to process either
row-major or column-major arrays. Data can be extracted from arrays of structures. Almost any regular
data storage pattern can be accommodated.

Distance is the amount of memory that exists between corresponding elements
in an FFT primitive in a batch. Distance is measured in the units of the FFT
primitive; complex data measures in complex units, and real data measures in
real data. Stride between tightly packed elements is 1 in either case. Typically,
one can measure the distance between any two elements in a batch primitive,
be it 1D, 2D, or 3D data. For tightly packed data, the distance between FFT
primitives is the size of the FFT primitive, such that dist=LenX for 1D data,
dist=LenX*LenY for 2D data, and dist=LenX*LenY*LenZ for 3D data. It is
possible to set the distance of a plan to be less than the size of the FFT vector;
most often 1 for this case. When computing a batch of 1D FFT vectors, if
distance == 1, and strideX == length( vector ), a transposed output is produced
for a batch of 1D vectors. It is left to the user to verify that the distance and
strides are valid (not intersecting); if not valid, undefined results can occur.

A simple example is to perform a 1D length 4096 on each row of an array of 1024 rows x 4096 columns of
values stored in a column-major array, such as a FORTRAN program might provide. (This would be equivalent
to a C or C++ program that had an array of 4096 rows x 1024 columns stored in a row-major manner, and
you wanted to perform a 1-D length 4096 transform on each column.) In this case, specify the strides
[1024, 1].

For a more complex example, an input buffer contained a raster grid of 1024 x 1024 monochrome pixel
values, and you want to compute a 2D FFT for each 64 x 64 subtile of the grid. Specifying strides
allows you to treat each horizontal band of 1024 x 64 pixels as an array of 16 64 x 64 matrixes,
and process an entire band with a single call to @ref clfftEnqueueTransform. (Specifying strides is not
quite flexible enough to transform the entire grid of this example with a single kernel execution.)
It is possible to create a Plan to compute arrays of 64 x 64 2D FFTs, then specify three strides:
[1, 1024, 64]. The first stride, 1, indicates that the rows of each matrix are stored consecutively;
the second stride, 1024, gives the distance between rows, and the third stride, 64, defines the
distance from matrix to matrix. Then call @ref clfftEnqueueTransform 16 times: once for each
horizontal band of pixels.

@subsection EnumPrecision Supported Precisions in clFFT
Both \c CLFFT_SINGLE and \c CLFFT_DOUBLE precisions are supported by the library
for all supported radices. With both of these enums the host computer's math
functions are used to produce tables of sines and cosines for use by the OpenCL
kernel.

Both \c CLFFT_SINGLE_FAST and \c CLFFT_DOUBLE_FAST are meant to generate faster
kernels with reduced accuracy, but are disabled in the current build..

See @ref clfftPrecision, @ref clfftSetPlanPrecision, and @ref clfftGetPlanPrecision.

@subsection FftDirection clfftDirection
The direction of the transform is not baked into the plan; the same plan can be used to specify both forward
and backward transforms. Instead, @ref clfftDirection is passed as a parameter into @ref clfftEnqueueTransform.

@subsection EnumResultLocation In-Place and Out-of-Place
The clFFT API supports both in-place and out-of-place transforms. With inplace
transforms, only input buffers are provided to the @ref clfftEnqueueTransform() API,
and the resulting data is written in the same buffers, overwriting the input data.
With out-of-place transforms, distinct output buffers are provided to the
@ref clfftEnqueueTransform() API, and the inputdata is preserved.
In-place transforms require that the \c cl_mem objects the client
creates have both \c read and \c write permissions. This is given in the nature of the
in-place algorithm. Out-of-place transforms require that the destination buffers
have \c read and \c write permissions, but input buffers can still be created with
read-only permissions. This is a clFFT requirement because internally the
algorithms may go back and forth between the destination buffers and internally
allocated temp buffers. For out-of-place transforms, clFFT never writes back
to the input buffers.

@subsection clFFTEff Batches
The efficiency of clFFT is improved by utilizing transforms in batches. Sending
as much data as possible in a single transform call leverages the parallel
compute capabilities of OpenCL devices (and GPU devices in particular), and
minimizes the penalty of transfer overhead. It's best to think of an OpenCL device
as a high-throughput, high-latency device. Using a networking analogy as an
example, it's similar to having a massively high-bandwidth pipe with very high
ping response times. If the client is ready to send data to the device for compute,
it should be sent in as few API calls as possible. This can be done by batching.
clFFT plans have a parameter to describe the number of transforms being
batched: @ref clfftSetPlanBatchSize(), and to describe how those batches are
laid out and spaced in memory: @ref clfftSetPlanDistance(). 1D, 2D, or 3D
transforms can be batched.

@section Outline  Using clFFT on a Client Application

To perform FFT calculations using clFFT, the client program must:
<ul>
	<li> Initialize the library by calling @ref clfftSetup. </li>
	<li> For each distinct type of FFT needed: </li>
	<ol>
		<li> Create an FFT Plan object. This usually is done by calling the factory function @ref clfftCreateDefaultPlan.
		Some of the most fundamental parameters are specified at this time, and others assume default values.  The OpenCL
		context must be provided when the plan is created; it cannot be changed. Another way is to call @ref clfftCopyPlan.
		In either case, the function returns an opaque handle to the Plan object. </li>
		<li> Complete the specification of all of the Plan parameters by calling the various parameter-setting functions,
		\c clAmdFFtSet_____. </li>
		<li> Optionally, "bake" or finalize the plan, calling @ref clfftBakePlan. This signals to the library the end
		of the specification phase, and causes it to generate and compile the exact OpenCL kernels needed to perform the
		specified FFT on the OpenCL device provided.

		At this point, all performance-enhancing optimizations are applied, possibly including executing benchmark kernels
		on the OpenCL device context in order to maximize runtime performance.

		Although this step is optional, most users probably want to include it so that they can control when this work is
		done. Usually, this time consuming step is done when the application is initialized. If the user does not call
		@ref clfftBakePlan, this work is done during the first call to @ref clfftEnqueueTransform.
		</li>
	</ol>

	<li> The OpenCL FFT kernels now are ready to execute as many times as needed. </li>
	<ol>
		<li>  Call @ref clfftEnqueueTransform. At this point, specify whether you want to execute a forward or reverse
		transform; also, provide the OpenCL \c cl_mem handles for the input buffer(s), output buffer(s)--unless you want
		the transformed data to overwrite the input buffers, and (optionally) scratch buffer.

		@ref clfftEnqueueTransform performs one or more calls to the OpenCL function clEnqueueNDRangeKernel.
		Like clEnqueueNDRangeKernel, @ref clfftEnqueueTransform is a non-blocking call. The commands to
		execute the FFT compute kernel(s) are added to the OpenCL context queue to be executed asynchronously.
		An OpenCL event handle is returned to the caller. If multiple NDRangeKernel operations are queued,
		the final event handle is returned.
		</li>
		<li>  The application now can add additional OpenCL tasks to the OpenCL context's queue. For example, if the
		next step in the application's process is to apply a filter to the transformed data, the application would generate
		that clEnqueueNDRangeKernel, specifying the transform's output buffer(s) as the input to the filter kernel,
		and providing the transform's event handle to ensure proper synchronization. </li>
		<li>  If the application must access the transformed data directly, it must call one of the OpenCL functions
		for synchronizing the host computer's execution with the OpenCL device (for example: clFinish()). </li>
	</ol>
	<li> Terminate the library by calling @ref clfftTeardown.
</ul>

@section RealFFT  FFTs of Real Data

When real data is subject to DFT transformation, the resulting complex output
follows a special property. About half of the output is redundant because they are
complex conjugates of the other half. This is called the Hermitian redundancy.
So, for space and performance considerations, it is only necessary to store the
non-redundant part of the data. Most FFT libraries use this property to offer
specific storage layouts for FFTs involving real data. clFFT provides 3
enumerated types to deal with real data FFTs:

<ul>
	<li> \c CLFFT_REAL
	<li> \c CLFFT_HERMITIAN_INTERLEAVED
	<li> \c CLFFT_HERMITIAN_PLANAR
</ul>

The first enum specifies that the data is purely real. This can be used to feed
real input or get back real output. The second and third enums specify layouts
for storing FFT output. They are similar to the corresponding full complex enums
in the way they store real and imaginary components. The difference is that they
store only about half of the complex output. Client applications can do just a
forward transform and analyze the output. Or they can do some processing of
the output and do a backward transform to get back real data. This is illustrated
in the following figure.

@image html realfft_fwdinv.jpg "Forward and Backward Transform Processes"

Let us consider a 1D real FFT of length N. The full output looks as shown in
following figure.

@image html realfft_1dlen.jpg "1D Real FFT of Length N"

Here, C* denotes the complex conjugate of. Since the values at indices greater
than N/2 can be deduced from the first half of the array, clFFT stores data
only up to the index N/2. This means that the output contains only 1 + N/2
complex elements, where the division N/2 is rounded down. Examples for even
and odd lengths are given below.

Example for N = 8 is shown in following figure.

@image html realfft_ex_n8.jpg "Example for N = 8"

Example for N = 7 is shown in following figure.

@image html realfft_ex_n7.jpg "Example for N = 7"


For length 8, only (1 + 8/2) = 5 of the output complex numbers are stored, with
the index ranging from 0 through 4. Similarly for length 7, only (1 + 7/2) = 4 of
the output complex numbers are stored, with the index ranging from 0 through 3.

For 2D and 3D FFTs, the FFT length along the least dimension is used to
compute the (1 + N/2) value. This is because the FFT along the least dimension
is what is computed first and is logically a real-to-hermitian transform. The FFTs
along other dimensions are computed afterwards; they are simply 'complex-tocomplex'
transforms. For example, assuming clLengths[2] is used to set up a 2D
real FFT, let N1 = clLengths[1], and N0 = clLengths[0]. The output FFT has
N1*(1 + N0/2) complex elements. Similarly, for a 3D FFT with clLengths[3] and
N2 = clLengths[2], N1 = clLengths[1], and N0 = clLengths[0], the output has
N2*N1*(1 + N0/2) complex elements.

@subsection RealModes Supported Modes

Out-of-place transforms:

<ul>
	<li> \c CLFFT_REAL to \c CLFFT_HERMITIAN_INTERLEAVED
	<li> \c CLFFT_REAL to \c CLFFT_HERMITIAN_PLANAR
	<li> \c CLFFT_HERMITIAN_INTERLEAVED to \c CLFFT_REAL
	<li> \c CLFFT_ CLFFT_HERMITIAN_PLANAR to \c CLFFT_REAL
</ul>

In-place transforms:

<ul>
	<li> \c CLFFT_REAL to \c CLFFT_HERMITIAN_INTERLEAVED
	<li> \c CLFFT_HERMITIAN_INTERLEAVED to \c CLFFT_REAL
</ul>

@subsection ExplicitStrides Setting strides

The library currently <b> requires the user to explicitly set input and output strides for real transforms.</b> See
the following examples to understand what values to use for input and output strides under different scenarios. The
examples only show typical usages. The user has flexibility in allocating their buffers and laying out data according
to their needs.

@subsection RealExamples Examples

The following pages provide figures and examples to explain in detail the real
FFT features of this library.

@image html realfft_expl_01.jpg "1D FFT - Real to Hermitian"
@image html realfft_expl_02.jpg "1D FFT - Real to Hermitian, Example 1"
@image html realfft_expl_03.jpg "1D FFT - Real to Hermitian, Example 2"
@image html realfft_expl_04.jpg "1D FFT - Real to Hermitian, Example 3"
@image html realfft_expl_05.jpg "1D FFT - Hermitian to Real"
@image html realfft_expl_06.jpg "1D FFT - Hermitian to Real, Example"
@image html realfft_expl_07.jpg "2D FFT - Real to Hermitian In Place"
@image html realfft_expl_08.jpg "2D FFT - Real to Hermitian, Example"

 */
